DebugCollector: first class file archival #9555

davepacheco · 2025-12-20T04:35:37Z

This PR adds a more first-class file archival mechanism inside the debug collector within sled agent. The reason I did this is that in the past when I wanted to modify the files that Sled Agent collects, I found it tricky to do because:

it's hard to add support for collecting new kinds of files without touching the code related to collecting other files
that code combines both decision-making and execution logic, which makes it tricky to change
the tests for it are good end-to-end tests but very coarse

Altogether I basically felt like even when making a pretty small change, you essentially had to test in a real deployment, which is a much slower dev workflow than it needs to be (and it'd be very easy to break without breaking CI).

This is coming up because I'm planning to implement RFD 613 Debug Dropbox shortly.

After this PR, there's a new file_archiver module:

There's a list of rules (in the rules submodule) that describe what files to collect. I hope it's easy to add new things to this.
Archival is now implemented with the plan-execute pattern. Nearly all the behavior is in the planner so that it can be exhaustively tested without having to actually construct directory trees. (There's still an end-to-end smoke test that does actually verify what happens with files on disk.)
There's a pretty comprehensive test suite that's based on a list of paths found on real systems (from dogfood). It includes checks to make sure the dataset itself covers all the different rules so that if we extend the rules but forget to update the test data, it will fail tests.

It's arguably overengineered at this point but I'm hopeful that this will make it a lot easier to augment the set of files that get archived in this way.

As a first step, I tried to preserve the existing behavior as much as possible. There are several oddities that we might want to fix in follow-up work:

There's no debouncing, meaning that it's possible that the system could try to archive a rotated log file or core file or crash dump while it's still being written, resulting in losing the original file and having only a partial copy in the debug dataset.
I need to double-check this but it looks to me like the existing implementation overwrites existing ~~crash dumps and~~ core files. This would be buggy but not a huge deal for core files because their names are pretty unusual (they have pids and execnames and zonenames in them)~~, but it seems likely to result in at most one crash dump kept per sled ever since they'll all be called vmdump.0~~.
Rotated log files include their original mtime as a Unix timestamp in the filename, but if there's already a file with that name, then the mtime is incremented until we find a file that does not exist. This is a little weird because it means this value is close to the mtime but not actually the mtime. (One behavior change in this PR is that we'll only check a max number of possible filenames -- 30 -- after that, we give up.)
Live log files wind up being called something.mtime instead of something.log.mtime the way that rotated log files do. This isn't a huge deal but does break oxlog (oxlog does not find archived live log files #9271). I'm not sure what we should do here. We could use the same convention but then we'd lose the distinction between live vs. rotated log files. I'm not sure if that's important.

davepacheco · 2025-12-20T04:38:05Z

Marking this draft because I still want to do some testing on a4x2 or a racklette.

davepacheco · 2025-12-22T23:29:08Z

I deployed this to london and:

induced a panic on sled 15
induced a process core dump on sled 14

Regarding sled 14, I did briefly see the process core dump here:

BRM42220036 # find /pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/crash
/pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/crash
/pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/crash/core.global.sleep.18626.1766442497
BRM42220036

As expected, that disappeared and the archived copy showed up:

BRM42220036 # find /pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/crash
/pool/int/a15c936d-bf01-4411-aab2-d401b66910c6/crash
BRM42220036 # ls /pool/ext/*/crypt/debug/core*
/pool/ext/0544a89a-a2f8-494c-ace8-099b7fbd4e13/crypt/debug/core.global.sleep.18626.1766442497

On the sled where I induced a system panic:

BRM42220030 # ls -l /pool/ext/*/crypt/debug | grep -v ^d
/pool/ext/0990c846-28ad-40f8-bb73-5b19d4d85771/crypt/debug:
total 992

/pool/ext/6167e7b2-5a4c-420e-aa1f-8fc1be95d639/crypt/debug:
total 10409274
-rw-r--r--   1 root     root           2 Dec 28  1986 bounds
-rw-r--r--   1 root     root     10682171392 Dec 28  1986 vmdump.0

/pool/ext/6bdd15b1-bde9-4bd7-a1ea-342c0278b555/crypt/debug:
total 0

/pool/ext/72d71e4b-ddff-4cef-ac16-8707e1a732d1/crypt/debug:
total 0

/pool/ext/8b4122b0-e873-4c7c-bc6c-d2de1dfc68ba/crypt/debug:
total 0

/pool/ext/bad19632-89c0-4cd7-8430-5463dab81e4c/crypt/debug:
total 0

/pool/ext/bb7cb68b-0f7b-4994-91bc-5bd3b6527438/crypt/debug:
total 0

/pool/ext/d7edfff9-e058-4dfc-bb5f-fa1b18cb760a/crypt/debug:
total 0

/pool/ext/f47ac7ce-8c14-420c-9419-db3a3f714840/crypt/debug:
total 0

In terms of log files: back on sled 14:

BRM42220036 # ls -l /pool/ext/*/crypt/debug/
/pool/ext/0544a89a-a2f8-494c-ace8-099b7fbd4e13/crypt/debug/:
total 3154
-rw-r--r--   1 root     root     2725748 Dec 22 22:30 core.global.sleep.18626.1766442497
drwxr-xr-x   2 root     root           2 Dec 28  1986 global
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_cockroachdb_701e477d-effb-4ca5-954c-23629e59236b
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_14ae8c8d-1b04-4c29-9127-5a801540a6a4
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_1e7579a3-62e3-4377-9fe4-aa70cbbaa713
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_3cf95991-6e04-417b-b25f-f10568c2e420
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_55316099-f1d1-49de-a646-c041a86dea03
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_5af3a531-c868-4c02-a6cb-03a24ae010fc
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_5af6e97e-900d-4cc2-84ad-05210d6391fe
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_7cd97885-d96a-4e0a-9917-fa1c3a94e1bb
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_8aa08df3-6310-4c35-bf85-86c81dfe5549
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_crucible_d7d018b4-9567-46be-96e9-0f45141a9368
drwxr-xr-x   2 root     root         109 Dec 22 21:00 oxz_crucible_pantry_1a4a450b-c260-4ed8-80f6-735d26853263
drwxr-xr-x   2 root     root         110 Dec 22 21:00 oxz_external_dns_95f461ab-5307-4272-a3f3-7a3bd870792d
drwxr-xr-x   2 root     root         109 Dec 22 20:55 oxz_internal_dns_429d414f-135d-45f0-873c-2d2ec1427bb2
drwxr-xr-x   2 root     root         112 Dec 22 20:55 oxz_ntp_931d6ed6-bb5b-4afc-9b45-2fcadfd18dfa
drwxr-xr-x   2 root     root         110 Dec 22 23:00 oxz_oximeter_34208b99-36b3-4b2d-9fb1-cbd5a3f15804
drwxr-xr-x   2 root     root         119 Dec 22 20:55 oxz_switch

/pool/ext/0811ae71-4df3-49d5-9177-96ad02d13302/crypt/debug/:
total 0

/pool/ext/0a189e94-85d0-40ff-b84d-7957fb701743/crypt/debug/:
total 0

/pool/ext/26ab05fb-d580-41d4-9c89-c147ffcd2371/crypt/debug/:
total 0

/pool/ext/2f3d8653-5b88-4ccd-a91e-0e18c1098548/crypt/debug/:
total 0

/pool/ext/3510e4ab-8fff-47f5-8410-6404801ab35b/crypt/debug/:
total 0

/pool/ext/c4336926-de97-4224-8570-1fdae2e98223/crypt/debug/:
total 0

/pool/ext/c5403d97-0256-47fb-afa4-1e94479b3545/crypt/debug/:
total 0

/pool/ext/e31e6dd4-150b-4275-92c8-9fe4bc5c6f68/crypt/debug/:
total 0

BRM42220036 # ls -l /pool/ext/*/crypt/debug/oxz_crucible_14ae8c8d-1b04-4c29-9127-5a801540a6a4 | head
total 972
-rw-r--r--   1 root     root          75 Dec 22 21:00 application-management-net-snmp:default.log.1766436541
-rw-r--r--   1 root     root          30 Dec 22 21:00 application-security-tcsd:default.log.1766436536
-rw-r--r--   1 root     root         173 Dec 22 21:00 messages.1766436552
-rw-r--r--   1 root     root          80 Dec 22 21:00 milestone-devices:default.log.1766436537
-rw-r--r--   1 root     root         144 Dec 22 21:00 milestone-multi-user-server:default.log.1766436542
-rw-r--r--   1 root     root         400 Dec 22 21:00 milestone-multi-user:default.log.1766436542
-rw-r--r--   1 root     root         326 Dec 22 21:00 milestone-name-services:default.log.1766436542
-rw-r--r--   1 root     root          80 Dec 22 21:00 milestone-network:default.log.1766436537
-rw-r--r--   1 root     root         150 Dec 22 21:00 milestone-single-user:default.log.1766436542
BRM42220036 # ls -l /pool/ext/*/crypt/debug/oxz_crucible_14ae8c8d-1b04-4c29-9127-5a801540a6a4 | tail
-rw-r--r--   1 root     root         155 Dec 22 21:00 system-pkgserv:default.log.1766436538
-rw-r--r--   1 root     root          30 Dec 22 21:00 system-process-security:default.log.1766436537
-rw-r--r--   1 root     root         165 Dec 22 21:00 system-rbac:default.log.1766436540
-rw-r--r--   1 root     root         238 Dec 22 21:00 system-rmtmpfiles:default.log.1766436537
-rw-r--r--   1 root     root         179 Dec 22 21:00 system-sac:default.log.1766436542
-rw-r--r--   1 root     root          30 Dec 22 21:00 system-svc-global:default.log.1766436537
-rw-r--r--   1 root     root         206 Dec 22 21:00 system-system-log:default.log.1766436542
-rw-r--r--   1 root     root       70420 Dec 22 21:00 system-update-man-index:default.log.1766436543
-rw-r--r--   1 root     root         205 Dec 22 21:00 system-utmp:default.log.1766436542
-rw-r--r--   1 root     root          30 Dec 22 21:00 system-vtdaemon:default.log.1766436537

davepacheco · 2025-12-22T23:35:39Z

I started a VM and was surprised to immediately see some rotated files in the debug dataset, but it turns out that yes, cron had run logadm despite this zone being only minutes old:

root@oxz_propolis:~# date
Mon Dec 22 23:32:51 UTC 2025
root@oxz_propolis:~# uptime
23:32:53    up 4 min(s),  1 user,  load average: 0.13, 0.06, 0.02
root@oxz_propolis:~# cat /var/cron/log 
! *** cron started ***   pid = 25187 Mon Dec 22 23:28:54 2025
>  CMD: /usr/sbin/logadm
>  root 26006 c Mon Dec 22 23:30:00 2025
<  root 26006 c Mon Dec 22 23:30:01 2025

Out in the GZ:

BRM42220030 # ls -l /pool/ext/*/crypt/debug/oxz_propolis-server_*
total 970
-rw-r--r--   1 root     root          75 Dec 22 23:30 application-management-net-snmp:default.log.1766446133
-rw-r--r--   1 root     root          30 Dec 22 23:30 application-security-tcsd:default.log.1766446130
-rw-r--r--   1 root     root         180 Dec 22 23:30 messages.1766446143
-rw-r--r--   1 root     root          80 Dec 22 23:30 milestone-devices:default.log.1766446130
-rw-r--r--   1 root     root         144 Dec 22 23:30 milestone-multi-user-server:default.log.1766446134
-rw-r--r--   1 root     root         400 Dec 22 23:30 milestone-multi-user:default.log.1766446134
-rw-r--r--   1 root     root         326 Dec 22 23:30 milestone-name-services:default.log.1766446133
-rw-r--r--   1 root     root          80 Dec 22 23:30 milestone-network:default.log.1766446130
-rw-r--r--   1 root     root         150 Dec 22 23:30 milestone-single-user:default.log.1766446134
-rw-r--r--   1 root     root          80 Dec 22 23:30 milestone-sysconfig:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-chrony:default.log.1766446130
-rw-r--r--   1 root     root         163 Dec 22 23:30 network-datalink-management:default.log.1766446129
-rw-r--r--   1 root     root         179 Dec 22 23:30 network-dns-client:default.log.1766446133
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-dns-install:default.log.1766446129
-rw-r--r--   1 root     root        2787 Dec 22 23:30 network-inetd-upgrade:default.log.1766446134
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-inetd:default.log.1766446134
-rw-r--r--   1 root     root         236 Dec 22 23:30 network-initial:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-install:default.log.1766446129
-rw-r--r--   1 root     root         161 Dec 22 23:30 network-ip-interface-management:default.log.1766446130
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-ipmp:default.log.1766446133
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-ipqos:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-ipsec-ike:default.log.1766446129
-rw-r--r--   1 root     root         157 Dec 22 23:30 network-ipsec-ipsecalgs:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-ipsec-manual-key:default.log.1766446129
-rw-r--r--   1 root     root         283 Dec 22 23:30 network-ipsec-policy:default.log.1766446130
-rw-r--r--   1 root     root         166 Dec 22 23:30 network-iptun:default.log.1766446130
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-ipv4-forwarding:default.log.1766446134
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-ipv6-forwarding:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-ldap-client:default.log.1766446129
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-location:default.log.1766446129
-rw-r--r--   1 root     root         332 Dec 22 23:30 network-loopback:default.log.1766446130
-rw-r--r--   1 root     root         152 Dec 22 23:30 network-netcfg:default.log.1766446130
-rw-r--r--   1 root     root         162 Dec 22 23:30 network-netmask:default.log.1766446130
-rw-r--r--   1 root     root         275 Dec 22 23:30 network-physical:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-physical:nwam.log.1766446129
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-routing-legacy-routing:ipv4.log.1766446134
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-routing-legacy-routing:ipv6.log.1766446134
-rw-r--r--   1 root     root         188 Dec 22 23:30 network-routing-ndp:default.log.1766446144
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-routing-rdisc:default.log.1766446134
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-routing-ripng:default.log.1766446134
-rw-r--r--   1 root     root         120 Dec 22 23:30 network-routing-route:default.log.1766446134
-rw-r--r--   1 root     root         327 Dec 22 23:30 network-routing-setup:default.log.1766446134
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-rpc-bind:default.log.1766446133
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-rpc-keyserv:default.log.1766446129
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-security-kadmin:default.log.1766446133
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-security-krb5kdc:default.log.1766446133
-rw-r--r--   1 root     root         355 Dec 22 23:30 network-service:default.log.1766446133
-rw-r--r--   1 root     root         167 Dec 22 23:30 network-shares-group:default.log.1766446134
-rw-r--r--   1 root     root         163 Dec 22 23:30 network-shares-group:zfs.log.1766446134
-rw-r--r--   1 root     root          75 Dec 22 23:30 network-smb-client:default.log.1766446133
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-smb-server:default.log.1766446130
-rw-r--r--   1 root     root         263 Dec 22 23:30 network-ssh:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 network-tcpkey:default.log.1766446129
-rw-r--r--   1 root     root           0 Dec 22 23:30 svc.startd.log.1766446128
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-auditd:default.log.1766446129
-rw-r--r--   1 root     root         163 Dec 22 23:30 system-auditset:default.log.1766446131
-rw-r--r--   1 root     root         170 Dec 22 23:30 system-boot-archive-update:default.log.1766446134
-rw-r--r--   1 root     root         163 Dec 22 23:30 system-boot-archive:default.log.1766446130
-rw-r--r--   1 root     root         166 Dec 22 23:30 system-boot-config:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-consadm:default.log.1766446131
-rw-r--r--   1 root     root         115 Dec 22 23:30 system-console-login:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-console-login:vt2.log.1766446131
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-console-login:vt3.log.1766446131
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-console-login:vt4.log.1766446131
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-console-login:vt5.log.1766446131
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-console-login:vt6.log.1766446131
-rw-r--r--   1 root     root         199 Dec 22 23:30 system-coreadm:default.log.1766446134
-rw-r--r--   1 root     root         204 Dec 22 23:30 system-cron:default.log.1766446134
-rw-r--r--   1 root     root         334 Dec 22 23:30 system-cryptosvc:default.log.1766446133
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-device-allocate:default.log.1766446130
-rw-r--r--   1 root     root         164 Dec 22 23:30 system-device-audio:default.log.1766446130
-rw-r--r--   1 root     root         241 Dec 22 23:30 system-device-local:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-device-mpxio-upgrade:default.log.1766446129
-rw-r--r--   1 root     root          60 Dec 22 23:30 system-early-manifest-import:default.log.1766446128
-rw-r--r--   1 root     root          75 Dec 22 23:30 system-extended-accounting:flow.log.1766446133
-rw-r--r--   1 root     root          75 Dec 22 23:30 system-extended-accounting:net.log.1766446133
-rw-r--r--   1 root     root          75 Dec 22 23:30 system-extended-accounting:process.log.1766446133
-rw-r--r--   1 root     root          75 Dec 22 23:30 system-extended-accounting:task.log.1766446133
-rw-r--r--   1 root     root         159 Dec 22 23:30 system-filesystem-local:default.log.1766446134
-rw-r--r--   1 root     root         308 Dec 22 23:30 system-filesystem-minimal:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-filesystem-reparse:default.log.1766446130
-rw-r--r--   1 root     root         158 Dec 22 23:30 system-filesystem-root:default.log.1766446129
-rw-r--r--   1 root     root         218 Dec 22 23:30 system-filesystem-usr:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-fm-notify-params:default.log.1766446130
-rw-r--r--   1 root     root         295 Dec 22 23:30 system-fmd:default.log.1766446131
-rw-r--r--   1 root     root         161 Dec 22 23:30 system-hostid:default.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-hotplug:default.log.1766446130
-rw-r--r--   1 root     root         326 Dec 22 23:30 system-identity:domain.log.1766446133
-rw-r--r--   1 root     root         164 Dec 22 23:30 system-identity:node.log.1766446130
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-idmap:default.log.1766446129
-rw-r--r--   1 root     root       81967 Dec 22 23:30 system-illumos-propolis-server:default.log.1766446195
-rw-r--r--   1 root     root         271 Dec 22 23:30 system-keymap:default.log.1766446133
-rw-r--r--   1 root     root         165 Dec 22 23:30 system-logadm-upgrade:default.log.1766446130
-rw-r--r--   1 root     root         288 Dec 22 23:30 system-manifest-import:default.log.1766446134
-rw-r--r--   1 root     root         273 Dec 22 23:30 system-name-service-cache:default.log.1766446133
-rw-r--r--   1 root     root         151 Dec 22 23:30 system-pfexec:default.log.1766446130
-rw-r--r--   1 root     root         155 Dec 22 23:30 system-pkgserv:default.log.1766446131
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-process-security:default.log.1766446130
-rw-r--r--   1 root     root         165 Dec 22 23:30 system-rbac:default.log.1766446133
-rw-r--r--   1 root     root         238 Dec 22 23:30 system-rmtmpfiles:default.log.1766446130
-rw-r--r--   1 root     root         179 Dec 22 23:30 system-sac:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-svc-global:default.log.1766446130
-rw-r--r--   1 root     root         206 Dec 22 23:30 system-system-log:default.log.1766446134
-rw-r--r--   1 root     root       70420 Dec 22 23:30 system-update-man-index:default.log.1766446134
-rw-r--r--   1 root     root         205 Dec 22 23:30 system-utmp:default.log.1766446134
-rw-r--r--   1 root     root          30 Dec 22 23:30 system-vtdaemon:default.log.1766446130

What I've yet to confirm:

any GZ logs
logs from zone shutdown of any control plane zone

davepacheco · 2025-12-22T23:43:14Z

I tried restarting sled-agent on sled 14, expecting to see that the restart would bounce control plane zones and archive their live log files, but I don't seem to see the live log files archived. I need to do more digging on that.

davepacheco · 2025-12-23T00:29:22Z

I watched the global zone files get rotated as expected:

BRM42220036 # ls -l /pool/ext/*/crypt/debug/global
/pool/ext/0544a89a-a2f8-494c-ace8-099b7fbd4e13/crypt/debug/global:
total 0

/pool/ext/0811ae71-4df3-49d5-9177-96ad02d13302/crypt/debug/global:
total 0
BRM42220036 # ls -l /var/svc/log/*.log.*
-rw-r--r--   1 root     root     4335276 Dec 22 23:47 /var/svc/log/oxide-mg-ddm:default.log.0
-rw-r--r--   1 root     root     15328015 Dec 22 23:47 /var/svc/log/oxide-sled-agent:default.log.0

then a few minutes later:

BRM42220036 # ls -l /pool/ext/*/crypt/debug/global
/pool/ext/0544a89a-a2f8-494c-ace8-099b7fbd4e13/crypt/debug/global:
total 0

/pool/ext/0811ae71-4df3-49d5-9177-96ad02d13302/crypt/debug/global:
total 3731
-rw-r--r--   1 root     root      170934 Dec 22 23:52 messages.1766446712
-rw-r--r--   1 root     root     4335276 Dec 22 23:52 oxide-mg-ddm:default.log.1766447265
-rw-r--r--   1 root     root     15328015 Dec 22 23:52 oxide-sled-agent:default.log.1766447265
BRM42220036 # ls -l /var/svc/log/*.log.*
/var/svc/log/*.log.*: No such file or directory

jgallagher

LGTM - happy to review again once this is out of draft.

jgallagher · 2026-01-13T21:27:01Z

sled-agent/config-reconciler/src/debug_collector/file_archiver/filesystem.rs

+
+#[derive(Debug, Error)]
+#[error("string is not a valid filename (has slashes or is '.' or '..')")]
+pub(crate) struct BadFilename;


Nit - maybe include the invalid filename itself?

jgallagher · 2026-01-13T21:30:18Z

sled-agent/config-reconciler/src/debug_collector/file_archiver/execution.rs

+
+    tokio::io::copy(&mut src_f, &mut dest_f).await?;
+
+    dest_f.sync_all().await?;


Do we also need to sync dest_f's parent directory?

This code hasn't changed, so it wouldn't be a regression, but I did go ahead and add this fsync.

davepacheco · 2026-01-13T23:33:10Z

I've discovered some minor confusion in this PR. For some reason that I did not write down, at some point I concluded incorrectly that Sled Agent writes kernel crash dumps to the "crash" dataset on the M.2. I suspect I concluded this because on dogfood, sled 13 has two crash dumps there:

BRM42220018 # ls -lE /pool/int/*/crash
/pool/int/0de9f77c-dfa4-4cbd-a6a8-eab074958fe7/crash:
total 0

/pool/int/35dcb885-18cf-4842-a17f-fb63e76a5f2c/crash:
total 25072563
-rw-r--r--   1 root     root           2 1986-12-28 00:00:48.286332156 +0000 bounds
-rw-r--r--   1 root     root     10597105664 1986-12-28 00:00:51.251426419 +0000 vmdump.0
-rw-r--r--   1 root     root     9466478592 1986-12-28 00:00:48.286226856 +0000 vmdump.1

If this were the normal process, then I would expect that both the old code (in main) and the new code (in this PR) would wind up clobbering all but the most recent crash dump in the debug dataset. That's because when savecore is run, it would write the dump to an empty cores directory, then archive it to the debug dataset, deleting the copy in the cores directory. The next time it happens, the same thing would happen, using the same filename (vmdump.0) because it wouldn't collide with another file in the cores dataset. But it would collide with a file in the debug dataset, and the existing code (which I copied forward here) overwrites an existing file.

But that's not what happens. The block comment:

omicron/sled-agent/config-reconciler/src/debug_collector/worker.rs

Lines 49 to 80 in e6961bc

    
           //!     +------------------------+  +--------------+      +---------------+ 
        
           //!     | dump device containing |  | user process |      | log files     | 
        
           //!     | kernel crash dump      |  +--------------+      | inside zones  | 
        
           //!     +------------------------+        |               +---------------+ 
        
           //!            |                          |                       | 
        
           //!            |                          | process crash:        | 
        
           //!            |                          | system writes         | 
        
           //!            | DebugCollector           | core dump to          | 
        
           //!            | invokes                  | configured            | 
        
           //!            | savecore(8)              | directory             | 
        
           //!            |                          |                       | 
        
           //!            |                          v                       | 
        
           //!            |        +--------------------------------------+  | 
        
           //!            |        |        chosen "core" dataset         |  | 
        
           //!            |        | (ZFS dataset on internal (M.2) disk) |  | 
        
           //!            |        +--------------------------------------+  | 
        
           //!            |                          |                       | 
        
           //!            |                          |                       | 
        
           //!            |                          |                       | 
        
           //!            |                          | DebugCollector        | 
        
           //!            |                          | periodically archives | 
        
           //!            |                          | the core dumps and    | 
        
           //!            |                          | log files (copies to  | 
        
           //!            |                          | debug dataset, then   | 
        
           //!            |                          | deletes the original) | 
        
           //!            |                          |                       | 
        
           //!            v                          v                       v 
        
           //!        +----------------------------------------------------------+ 
        
           //!        |    debug datasets (ZFS datasets on external (U.2) disks) |-+ 
        
           //!        +----------------------------------------------------------+ |-+ 
        
           //!          +----------------------------------------------------------+ | 
        
           //!            +----------------------------------------------------------+

seems to reflect what really happens, which is that we have savecore write the dump directly from the dump slice to the debug dataset. That means it will correctly avoid a filename collision.

Edit: you might wonder why sled 13 still has those dumps in that directory, instead of them being archived and deleted? The answer is that sled agent doesn't know about any U.2s because the control plane hasn't been initialized there because it hasn't been part of the control plane in years.

Commit 71e24bc updates this PR to remove a crash dump from the test data. @jgallagher does all this make sense to you? I'm just a little nervous because if I'm wrong, then I'm afraid the new code would clobber crash dumps where the old one wouldn't have. But I don't think that's the case based on my current understanding. (I'm still not sure how a crash dump got into a cores dataset on that one sled.)

jgallagher · 2026-01-14T17:21:32Z

Commit 71e24bc updates this PR to remove a crash dump from the test data. @jgallagher does all this make sense to you? I'm just a little nervous because if I'm wrong, then I'm afraid the new code would clobber crash dumps where the old one wouldn't have. But I don't think that's the case based on my current understanding.

Yeah, that all makes sense.

(I'm still not sure how a crash dump got into a cores dataset on that one sled.)

I don't think that sled has ever been a part of the (current) dogfood control plane; otherwise we'd have an entry in sled for it (even if it was expunged and decommissioned). It's possible it was in the original control plane before we shipped 1.0 (at which point we did a wipe + setup, in August 2023). I believe the very old compliance dump service could have written these, based on this comment and the old pilot implementation.

davepacheco · 2026-01-14T18:23:05Z

An important goal of this change is that the new automated tests cover much more of the end-to-end archival path than before. Still, I wanted to do some manual testing to make sure the last bits are wired up correctly. Here a summary of my manual test plan:

verify core dump flow:
- core generated on M2 dataset
- core then archived into debug dataset
- original on M2 dataset is removed
verify crash dump flow:
- crash dump written directly to debug dataset on boot
- non-GZ log files from before the panic appear in the debug dataset
verify presence of periodically rotated non-global-zone SMF log files (and originals are gone)
verify presence of periodically rotated GZ SMF log files (originals gone)
verify presence of periodically rotated /var/adm/messages file (original gone)
verify presence of periodically rotated non-global-zone /var/adm/messages (originals gone)

I installed e2a09c6 onto berlin with rkadm.

Core dump flow

I induced a few core dumps on sled 14:

BRM42220011 # cd /var/tmp
BRM42220011 # sleep 30
^\Quit (core dumped)
BRM42220011 # sleep 20
^\Quit (core dumped)

As expected, they showed up in the cores dataset on an M2:

BRM42220011 # ls -l /pool/int/*/crash/
/pool/int/27e11e12-1b24-4067-a4ef-8c2a754160fb/crash/:
total 0

/pool/int/ae50d73a-b206-4785-a55d-3f4b17acce95/crash/:
total 10786
-rw-------   1 root     root     2750323 Jan 12 23:39 core.global.sleep.11008.1768261153
-rw-------   1 root     root     2750323 Jan 12 23:39 core.global.sleep.11070.1768261157

BRM42220011 # pargs /pool/int/*/crash//core.global.sleep.*
core '/pool/int/ae50d73a-b206-4785-a55d-3f4b17acce95/crash/core.global.sleep.11008.1768261153' of 11008:        sleep 30
argv[0]: sleep
argv[1]: 30

core '/pool/int/ae50d73a-b206-4785-a55d-3f4b17acce95/crash/core.global.sleep.11070.1768261157' of 11070:        sleep 20
argv[0]: sleep
argv[1]: 20

A few seconds later, they were gone from there and archived onto a debug dataset
on a U2:

BRM42220011 # ls -l /pool/int/*/crash/
/pool/int/27e11e12-1b24-4067-a4ef-8c2a754160fb/crash/:
total 0

/pool/int/ae50d73a-b206-4785-a55d-3f4b17acce95/crash/:
total 0

BRM42220011 # ls /pool/ext/*/crypt/debug/core*
/pool/ext/14ab80eb-917c-47b7-9ee3-5ea73b06abe9/crypt/debug/core.global.sleep.11008.1768261153
/pool/ext/14ab80eb-917c-47b7-9ee3-5ea73b06abe9/crypt/debug/core.global.sleep.11070.1768261157

Crash dump / sled reboot flow

I induced a panic on sled 15:

BRM44220007 # date
Mon Jan 12 23:52:52 UTC 2026
BRM44220007 # mdb -wke 'clock/W 0'
clock:          0xe5894855      =       0x0

panic[cpu0]/thread=fffff78809c6ec20: BAD TRAP: type=e (#pf Page fault) rp=fffff78809c6e9f0 addr=100000001 occurred in module "<unknown>" due to an illegal access to a user address

sched: #pf Page fault
Bad kernel fault at addr=0x100000001
pid=0, pc=0x100000001, sp=0xfffff78809c6eae8, eflags=0x10292

On boot, when sled agent started up, it appeared idle (not starting zones), but it was actually running savecore. Using dumpadm I could see that it was configured to write the core file to a debug dataset:

BRM44220007 # dumpadm
      Dump content: kernel and current process pages
       Dump device: /devices/pci@0,0/pci1de,fff9@1,3/pci1344,3100@0/blkdev@w00A0750132748217,0:e (dedicated)
Savecore directory: /pool/ext/0cbc6f53-8b86-4a90-8482-61e8240fcdca/crypt/debug
  Savecore enabled: no
   Save compressed: on
BRM44220007 # ptree 656
656    ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxi
  657    /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
    723    /opt/oxide/sled-agent/tofino-monitor
    1094   /usr/bin/savecore -v

That eventually finished:

BRM44220007 # ls -lh /pool/ext/0cbc6f53-8b86-4a90-8482-61e8240fcdca/crypt/debug/vmdump.0
-rw-r--r--   1 root     root       9.48G Dec 28 00:04 /pool/ext/0cbc6f53-8b86-4a90-8482-61e8240fcdca/crypt/debug/vmdump.0

All of this happens before time is sync'd (since it hasn't started any zones, including the NTP zone). But a few minutes later, time did sync as expected.

A little while after this, I had these log files:

BRM44220007 # ls -l /pool/ext/*/crypt/debug/oxz_nexus_1ccc74a8-3beb-43ce-9d12-a59273283ef9/oxide-nexus*
-rw-r--r--   1 root     root     5381691 Jan 12 23:16 /pool/ext/0264034e-e243-44e1-a48d-6c8a0b1cb7be/crypt/debug/oxz_nexus_1ccc74a8-3beb-43ce-9d12-a59273283ef9/oxide-nexus:default.log.1768259698
-rw-r--r--   1 root     root     61865448 Dec 28  1986 /pool/ext/0cbc6f53-8b86-4a90-8482-61e8240fcdca/crypt/debug/oxz_nexus_1ccc74a8-3beb-43ce-9d12-a59273283ef9/oxide-nexus:default.1768262001
-rw-r--r--   1 root     root     5672155 Jan 13 00:16 /pool/ext/210ab515-2251-47a4-afdc-73897a3a3b6c/crypt/debug/oxz_nexus_1ccc74a8-3beb-43ce-9d12-a59273283ef9/oxide-nexus:default.log.1768263299

File /pool/ext/0cbc6f53-8b86-4a90-8482-61e8240fcdca/crypt/debug/oxz_nexus_1ccc74a8-3beb-43ce-9d12-a59273283ef9/oxide-nexus:default.1768262001 contains log entries from 2026-01-12T23:15:03.328Z through 2026-01-12T23:53:21.849Z. This leads right up to the panic above, which shows that we correctly archived log files from a previous OS boot.

Log files

To smoke check log archival without tying up a racklette for hours, I installed e2a09c6 in a4x2 and let it run overnight. I've got:

root@g0:~# ls -l /pool/ext/*/crypt/debug
/pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug:
total 3422
drwxr-xr-x   2 root     root           5 Jan 14 03:13 global
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_cockroachdb_af506837-7105-41f9-8f29-f8eda53e0763
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_cockroachdb_cff15986-b4da-4e60-b3d6-7e2bc7f47914
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_crucible_1dba0431-59cb-4b11-8ed1-714445035936
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_crucible_5c901576-ad9c-4124-9f1c-1389e6f2633c
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_crucible_8703b97d-3d07-4c32-963d-36cac5a6b0a8
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_crucible_d7294534-9bb0-4563-a595-6c27e0dbfd24
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_crucible_ff1453d8-9249-4ae0-a124-3ca4b21400ef
drwxr-xr-x   2 root     root         603 Jan 14 16:16 oxz_crucible_pantry_e80c3534-cca0-4c96-8a3d-81cd33b8b2aa
drwxr-xr-x   2 root     root         608 Jan 14 16:16 oxz_external_dns_cb052c9d-b5a0-4821-98a6-2921859f3857
drwxr-xr-x   2 root     root         603 Jan 14 16:16 oxz_internal_dns_32767510-7438-4b76-a12b-2c511d530036
drwxr-xr-x   2 root     root         615 Jan 14 16:46 oxz_nexus_d372dcf7-e713-4d98-8830-bce3d95783d9
drwxr-xr-x   2 root     root         618 Jan 14 16:16 oxz_ntp_1a7b0181-3aa1-47a3-bf72-c2344d6c4f61
drwxr-xr-x   2 root     root         657 Jan 14 17:31 oxz_switch

/pool/ext/8eca2211-9ba1-49b2-942c-20e1663d3c5d/crypt/debug:
total 0

/pool/ext/9369fd66-7b04-4da3-8071-a930f7127481/crypt/debug:
total 0

/pool/ext/bdb77016-f97b-4858-b6da-7d334df0e924/crypt/debug:
total 0

/pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/debug:
total 0

We've got archived global zone log files:

root@g0:~# ls -l /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/global
total 3891
-rw-r--r--   1 root     root       50067 Jan 14 03:13 messages.1768349402
-rw-r--r--   1 root     root     2658909 Jan 14 03:13 oxide-mg-ddm:default.log.1768360322
-rw-r--r--   1 root     root     17274638 Jan 14 03:13 oxide-sled-agent:default.log.1768360322

and their originals are gone:

root@g0:~# ls -l /var/svc/log/*.0
/var/svc/log/*.0: No such file or directory

root@g0:~# ls -l /var/adm/messages*
-rw-r--r--   1 root     root         129 Jan 14 17:11 /var/adm/messages

The counts of entries in the non-global-zone directories above show that we have hundreds of log files for these zones, including the switch zone. To give a sense of it:

root@g0:~# ls -l /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_switch
total 338247
-rw-r--r--   1 root     root          75 Jan 14 00:05 application-management-net-snmp:default.log.1768348694
-rw-r--r--   1 root     root           0 Jan 14 04:19 application-management-net-snmp:default.log.1768348925
-rw-r--r--   1 root     root           0 Jan 14 08:19 application-management-net-snmp:default.log.1768364104
-rw-r--r--   1 root     root           0 Jan 14 12:15 application-management-net-snmp:default.log.1768378506
-rw-r--r--   1 root     root           0 Jan 14 16:16 application-management-net-snmp:default.log.1768392927
-rw-r--r--   1 root     root          30 Jan 14 00:07 application-pkg-dynamic-mirror:default.log.1768348689
-rw-r--r--   1 root     root           0 Jan 14 04:19 application-pkg-dynamic-mirror:default.log.1768348925
-rw-r--r--   1 root     root           0 Jan 14 08:19 application-pkg-dynamic-mirror:default.log.1768364102
-rw-r--r--   1 root     root           0 Jan 14 12:15 application-pkg-dynamic-mirror:default.log.1768378500
-rw-r--r--   1 root     root           0 Jan 14 16:16 application-pkg-dynamic-mirror:default.log.1768392900
...
-rw-r--r--   1 root     root         263 Jan 14 00:07 network-ssh:default.log.1768348699
-rw-r--r--   1 root     root           0 Jan 14 04:19 network-ssh:default.log.1768348925
-rw-r--r--   1 root     root           0 Jan 14 08:19 network-ssh:default.log.1768364106
-rw-r--r--   1 root     root           0 Jan 14 12:15 network-ssh:default.log.1768378508
-rw-r--r--   1 root     root           0 Jan 14 16:16 network-ssh:default.log.1768392930
-rw-r--r--   1 root     root          30 Jan 14 00:07 network-tcpkey:default.log.1768348688
-rw-r--r--   1 root     root           0 Jan 14 04:19 network-tcpkey:default.log.1768348925
-rw-r--r--   1 root     root           0 Jan 14 08:19 network-tcpkey:default.log.1768364106
-rw-r--r--   1 root     root           0 Jan 14 12:15 network-tcpkey:default.log.1768378509
-rw-r--r--   1 root     root           0 Jan 14 16:16 network-tcpkey:default.log.1768392931
-rw-r--r--   1 root     root     1568256 Jan 14 00:08 oxide-dendrite:default.log.1768348925
-rw-r--r--   1 root     root     107399122 Jan 14 01:48 oxide-dendrite:default.log.1768355101
-rw-r--r--   1 root     root     111932760 Jan 14 03:33 oxide-dendrite:default.log.1768361401
-rw-r--r--   1 root     root     112002255 Jan 14 05:19 oxide-dendrite:default.log.1768367701
-rw-r--r--   1 root     root     112055056 Jan 14 07:04 oxide-dendrite:default.log.1768374000
-rw-r--r--   1 root     root     112049359 Jan 14 08:50 oxide-dendrite:default.log.1768380301
-rw-r--r--   1 root     root     112060330 Jan 14 10:30 oxide-dendrite:default.log.1768386600
-rw-r--r--   1 root     root     111998613 Jan 14 12:15 oxide-dendrite:default.log.1768392900
-rw-r--r--   1 root     root     111744396 Jan 14 14:01 oxide-dendrite:default.log.1768399200
-rw-r--r--   1 root     root     111939686 Jan 14 15:46 oxide-dendrite:default.log.1768405500
-rw-r--r--   1 root     root     111871270 Jan 14 17:31 oxide-dendrite:default.log.1768411802
-rw-r--r--   1 root     root       39280 Jan 14 00:06 oxide-lldpd:default.log.1768348718
-rw-r--r--   1 root     root      142831 Jan 14 04:19 oxide-lldpd:default.log.1768364096
-rw-r--r--   1 root     root      139866 Jan 14 08:19 oxide-lldpd:default.log.1768378498
-rw-r--r--   1 root     root      136837 Jan 14 12:15 oxide-lldpd:default.log.1768392899
-rw-r--r--   1 root     root      136123 Jan 14 16:16 oxide-lldpd:default.log.1768407292
-rw-r--r--   1 root     root       58290 Jan 14 00:08 oxide-mg-ddm:default.log.1768348918
-rw-r--r--   1 root     root     3606553 Jan 14 04:19 oxide-mg-ddm:default.log.1768364103
-rw-r--r--   1 root     root     3269559 Jan 14 08:19 oxide-mg-ddm:default.log.1768378508
-rw-r--r--   1 root     root     3174315 Jan 14 12:15 oxide-mg-ddm:default.log.1768392930
-rw-r--r--   1 root     root     2925092 Jan 14 16:16 oxide-mg-ddm:default.log.1768407306
...

There are no old rotated log files in the zone:

root@g0:~# ls -l /zone/oxz_switch/root/var/svc/log/*.0
/zone/oxz_switch/root/var/svc/log/*.0: No such file or directory

Same for an Omicron zone:

root@g0:~# ls -l $(svcs -z oxz_nexus_d372dcf7-e713-4d98-8830-bce3d95783d9 -L nexus)*
-rw-r--r--   1 root     root     2945970 Jan 14 18:17 /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_nexus_d372dcf7-e713-4d98-8830-bce3d95783d9/root/var/svc/log/oxide-nexus:default.log

Similarly, these zones have no rotated syslog files still in them:

root@g0:~# ls -l /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_*/root/var/adm/messages*
-rw-r--r--   1 root     root           0 Jan 14 00:15 /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_cockroachdb_af506837-7105-41f9-8f29-f8eda53e0763/root/var/adm/messages
-rw-r--r--   1 root     root           0 Jan 14 00:15 /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_crucible_1dba0431-59cb-4b11-8ed1-714445035936/root/var/adm/messages
-rw-r--r--   1 root     root           0 Jan 14 00:15 /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_crucible_pantry_e80c3534-cca0-4c96-8a3d-81cd33b8b2aa/root/var/adm/messages
-rw-r--r--   1 root     root           0 Jan 14 00:15 /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_nexus_d372dcf7-e713-4d98-8830-bce3d95783d9/root/var/adm/messages
-rw-r--r--   1 root     root           0 Jan 14 00:02 /pool/ext/f83a8310-1fb4-41ea-a61e-c8e7b9fc148d/crypt/zone/oxz_ntp_1a7b0181-3aa1-47a3-bf72-c2344d6c4f61/root/var/adm/messages

though we do have archived ones:

root@g0:~# ls -l /pool/ext/*/crypt/debug/*/messages*
-rw-r--r--   1 root     root       50067 Jan 14 03:13 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/global/messages.1768349402
-rw-r--r--   1 root     root         176 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_cockroachdb_af506837-7105-41f9-8f29-f8eda53e0763/messages.1768349039
-rw-r--r--   1 root     root         176 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_cockroachdb_cff15986-b4da-4e60-b3d6-7e2bc7f47914/messages.1768349053
-rw-r--r--   1 root     root         173 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_crucible_1dba0431-59cb-4b11-8ed1-714445035936/messages.1768349328
-rw-r--r--   1 root     root         173 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_crucible_5c901576-ad9c-4124-9f1c-1389e6f2633c/messages.1768349327
-rw-r--r--   1 root     root         173 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_crucible_8703b97d-3d07-4c32-963d-36cac5a6b0a8/messages.1768349327
-rw-r--r--   1 root     root         173 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_crucible_d7294534-9bb0-4563-a595-6c27e0dbfd24/messages.1768349329
-rw-r--r--   1 root     root         173 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_crucible_ff1453d8-9249-4ae0-a124-3ca4b21400ef/messages.1768349328
-rw-r--r--   1 root     root         180 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_crucible_pantry_e80c3534-cca0-4c96-8a3d-81cd33b8b2aa/messages.1768349334
-rw-r--r--   1 root     root         583 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_external_dns_cb052c9d-b5a0-4821-98a6-2921859f3857/messages.1768349351
-rw-r--r--   1 root     root         176 Jan 14 00:08 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_internal_dns_32767510-7438-4b76-a12b-2c511d530036/messages.1768348725
-rw-r--r--   1 root     root         562 Jan 14 00:18 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_nexus_d372dcf7-e713-4d98-8830-bce3d95783d9/messages.1768349378
-rw-r--r--   1 root     root         553 Jan 14 00:08 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_ntp_1a7b0181-3aa1-47a3-bf72-c2344d6c4f61/messages.1768348750
-rw-r--r--   1 root     root         133 Jan 14 00:08 /pool/ext/33468c82-2b78-423f-aea1-d445cf462573/crypt/debug/oxz_switch/messages.1768348708

DebugCollector: first class file archival

0d8b945

davepacheco requested review from jgallagher and smklein December 20, 2025 04:35

davepacheco marked this pull request as draft December 20, 2025 04:37

fix link

372db0a

davepacheco added 4 commits December 22, 2025 16:40

fix archiving former zone roots

d50b954

logs

152f369

this is a better fix for the live file regression

deaf145

Merge branch 'main' into dap/debug-collector-files

e2a09c6

jgallagher reviewed Jan 13, 2026

View reviewed changes

crash dumps do not go to "crash" dataset

71e24bc

davepacheco marked this pull request as ready for review January 14, 2026 18:24

review feedback

e7aa605

jgallagher approved these changes Jan 14, 2026

View reviewed changes

davepacheco enabled auto-merge (squash) January 14, 2026 21:13

davepacheco merged commit 185b7c6 into main Jan 14, 2026
17 checks passed

davepacheco deleted the dap/debug-collector-files branch January 14, 2026 21:16


		tokio::io::copy(&mut src_f, &mut dest_f).await?;

		dest_f.sync_all().await?;

DebugCollector: first class file archival #9555

DebugCollector: first class file archival #9555

Uh oh!

Conversation

davepacheco commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Dec 20, 2025

Uh oh!

davepacheco commented Dec 22, 2025

Uh oh!

davepacheco commented Dec 22, 2025

Uh oh!

davepacheco commented Dec 22, 2025

Uh oh!

davepacheco commented Dec 23, 2025

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

jgallagher Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

davepacheco Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgallagher commented Jan 14, 2026

Uh oh!

davepacheco commented Jan 14, 2026

Core dump flow

Crash dump / sled reboot flow

Log files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davepacheco commented Dec 20, 2025 •

edited

Loading

davepacheco commented Jan 13, 2026 •

edited

Loading